5. Learning, badly

The time has arrived, the time to learn. Will we succeed? Let's see it

5.1. Preparing the notebook


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import cm as cmap

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder

sns.set(font='sans')

5.2. Learning with random forest

When I needed to select an algorithm for training I didn't know which of them I studied about was to work correctly. So, I asked to my thesis's supervisor, Fernando Sancho Ph.D., and his recommendation, according to his experience with other related projects, was to use random forests.

In the feature_columns variable showed in the next piece of code, we can see which attributes are going to be used for predicting.


In [3]:
labelize_columns = ['medallion', 'hack_license', 'vendor_id']

interize_columns = ['pickup_month', 'pickup_weekday', 'pickup_non_working_today', 'pickup_non_working_tomorrow']

feature_columns = ['medallion', 'hack_license', 'vendor_id', 'pickup_month', 'pickup_weekday', 'pickup_day',
                   'pickup_time_in_mins', 'pickup_non_working_today', 'pickup_non_working_tomorrow', 'fare_amount',
                   'surcharge', 'tolls_amount', 'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude',
                   'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

class_column = 'tip_label'

In [4]:
data = pd.read_csv('../data/dataset/dataset.csv')

Before starting the training, we need to transform the no numeric attributes to numeric ones, so that they can be used with scitkit-learn.


In [5]:
for column in labelize_columns:
    real_column = data[column].values
    
    le = LabelEncoder()
    le.fit(real_column)
    labelized_column = le.transform(real_column)
    
    data[column] = labelized_column
    
    le = None
    real_column = None
    labelized_column = None

In [6]:
for column in interize_columns:
    data[column] = data[column].astype(int)

Let's start the training! We are going to use 10-fold stratified cross-validation for training a random forest model with 256 trees.


In [7]:
data_features = data[feature_columns].values
data_classes = data[class_column].values

In [8]:
cross_validation = StratifiedShuffleSplit(data_classes, n_iter=10, test_size=0.1, random_state=0)

scores = []
confusion_matrices = []

for train_index, test_index in cross_validation:
    data_features_train, data_classes_train = data_features[train_index], data_classes[train_index]
    data_features_test, data_classes_test = data_features[test_index], data_classes[test_index]
    
    '''
    You need at least 16GB RAM for predicting 6 classes with 256 trees.
    Of course, you can use a lower number, but gradually you'll notice worse performance.
    '''
    clf = RandomForestClassifier(n_estimators=256, n_jobs=-1)
    clf.fit(data_features_train, data_classes_train)
    
    # Saving the scores.
    test_score = clf.score(data_features_test, data_classes_test)
    scores.append(test_score)
    
    # Saving the confusion matrices.
    data_classes_pred = clf.predict(data_features_test)
    cm = confusion_matrix(data_classes_test, data_classes_pred)
    confusion_matrices.append(cm)
    
    clf = None

print 'Accuracy mean: ' + str(np.mean(scores))
print 'Accuracy std: ' + str(np.std(scores))


Accuracy mean: 0.529469
Accuracy std: 0.000902712024956

A prediction with an accuracy of 52.95%. What happened?

5.3. What happened?

As I am not a machine learning expert, I'm not 100% sure of what were the problems for this bad result. This is an indicator that I have yet to study more machine learning theory. A thing I'm willing to do, spoiler, specially after the results we will obtain in the next notebook.

For trying to know the reason of the bad accuracy, let's use another tool for measuring the performance, a confusion matrix.


In [9]:
classes = [' ', '[0-10)', '[10-15)', '[15-20)', '[20-25)', '[25-30)', '[30-inf)']

first = True
cm = None

for cm_iter in confusion_matrices:
    if first:
        cm = cm_iter.copy()
        first = False
    else:
        cm = cm + cm_iter

fig, axes = plt.subplots()

colorbar = axes.matshow(cm, cmap=cmap.Blues)
fig.colorbar(colorbar, ticks=[0, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000, 225000, 250000])

axes.set_xlabel('Predicted class', fontsize=15)
axes.set_ylabel('True class', fontsize=15)

axes.set_xticklabels(classes)
axes.set_yticklabels(classes)

axes.tick_params(labelsize=12)


This is pretty strange. It looks like all the classes want to be only in two of them! Let's check how the tip is distributed in the dataset.


In [10]:
tip = data.groupby('tip_perc').size()
tip.index = np.floor(tip.index)

ax = tip.groupby(tip.index).sum().plot(kind='bar', figsize=(15, 5))

ax.set_xlabel('floor(tip_perc)', fontsize=18)
ax.set_ylabel('number of trips', fontsize=18)
ax.tick_params(labelsize=12)

tip = None


By looking at the previous figure we can say that the social norm is to tip the 20% of the charge. Perhaps that is the question, to know if a tip will be above or below that norm.

For answering that, let's change the classes for use only two:

$$ ``<\:20"\:\:and\:\:``>=\:20" $$

In [11]:
tip_labels = ['< 20', '>= 20']
tip_ranges_by_label = [[0.0, 20.0], [20.0, 51.0]]

for i, tip_label in enumerate(tip_labels):
    tip_mask = ((data.tip_perc >= tip_ranges_by_label[i][0]) & (data.tip_perc < tip_ranges_by_label[i][1]))
    data.tip_label[tip_mask] = tip_label
    
    tip_mask = None

In [12]:
data.to_csv('../data/dataset/dataset.csv', index=False)

Will this change work? Let's find it out in the next notebook.